Re: \X is broken

Moritz Lenz

unread,

Mar 7, 2009, 6:10:13 AM3/7/09

to perl5-...@perl.org

Tom Christiansen wrote:
> \X is broken. It is only checking for (?:\PM\pM*),

It better be (?>\PM\pm*) for now (and I think it is).

> I want properties that tell me the width of a grapheme cluster.
> I can't seem to find those. The InHalfwidthAndFullwidthForms
> property is utterly inadequate. I want to know whether a
> cluster takes up 0, 1, or 2 spots on the line. I want to
> know something's "print length". So does printf and format.

Yes, please. That would be *so* helpful for anything that cares about
formatting.

Cheers,
Moritz

Tom Christiansen

unread,

Mar 7, 2009, 1:51:09 PM3/7/09

to Moritz Lenz, Damian Conway, Larry Wall, perl5-...@perl.org

[ Damian && Larry CC'd due to Unicode troubles of
hyphens and syllables, of diphthongs and triphthongs,
and of digraphs and trigraphs. ]

Moritz Lenz wrote on Sat, 07 Mar 2009 12:10:13 +0100:

> Tom Christiansen wrote:

>> \X is broken. It is only checking for (?:\PM\pM*),

> It better be (?>\PM\pm*) for now (and I think it is).

Well, yes, that's better, but it still doesn't address
that the Standard defines a combining character sequence
differently than Perl does:

Combining Character Sequence. A maximal character
sequence consisting of either a base character followed
by a sequence of one or more characters where each is a
combining character, zero width joiner, or zero width
non-joiner; or a sequence of one or more characters
where each is a combining character, zero width joiner,
or zero width non-joiner.

Now, given:

U+200C ZERO WIDTH NON-JOINER
* commonly abbreviated ZWNJ

U+200D ZERO WIDTH JOINER
* commonly abbreviated ZWJ

I read that \X should instead be at least:

(?>\PM[\pM\x{200C}\x{200D}]*)

and, unfortunately, if you read more closely still, even:

(?>\PM[\pM\x{200C}\x{200D}]*|[\pM\x{200C}\x{200D}]+)

which I am bothered by. It points to an illegal sequence,
which can occur, but is illegal. Blah.

Also, notice they've stated things as one or more. So
technically you'd have to check the length > 1 to see
whether you got a sequence or a single letter. The current
definition is more useful to us, although it needs to have
ZWNJ and ZWJ to bring it up spec.

I stumbled upon this formality when trying to figure out how
to annotate a word to give hints on allowable hyphenation,
that is where line breaking should or should not be permitted.

Consider these; words in the first column MUST NOT be separated
between the first pair of adjacent vowels in each, while those
in the second column MAY BE thus parted:

coelacanth coefficient
coinmaking coinmate
coopering cooperating
realize realign
reaper reapply
reeling reelect
reindeer reindict

To make that distinction, the two traditional ways
of writing the second column are to use either:

coelacanth co-efficient
coinmaking co-inmate
coopering co-operating
realize re-align
reaper re-apply
reeling re-elect
reindeer re-indict

or, alternately, saving some space:

coelacanth coëfficient
coinmaking coïnmate
coopering coöperating
realize reälign
reaper reäpply
reeling reëlect
reindeer reïndict

But there are situations where this CANNOT be used, since we
can't use a diaeresis to indicate hiatial disjunction unless
it be over a vowel. Our many consonant pairs have no remedy
to distinguish a unified digraph from mere catenation:

GH: roughen pigheaded
PH: symphony taphouse
RH: pseudorheumatic superhero
SH: ashen gashouse
TH: cathedral cathole
WH: nowhere snowhouse

And we've the SC, CH, SCH cases, too: like antischolastic,
anticholeric, discharge, immiscibility, miscall, miscable,
miscible, miscipher, mischaracterize, mischievous, orchard,
orcherd :-), seneschal, and unschooled. Or for real run,
hyperconscientiousness and thermophosphorescence. Try to
autosyllabify those! (Although chthonic and phthisical are
actually more easily autosyllabated than they are said.)

Just DON'T get me started on what do about demosaïcking,
as in of raw sensor data. :-)

There's not much to do about the consonantal digraphs and
trigraphs unless you resort to explicit hyphenation, which
isn't really appropriate:

roughen pig-headed
symphony tap-house
pseudo-rheumatic super-hero
ashen gas-house
cathedral cat-hole
nowhere snow-house

I had hoped that this were a matter for
which I could use the following:

U+00AD SOFT HYPHEN
= discretionary hyphen
* commonly abbreviated as SHY

but every darned font I find counts it as
a printing character completely equivalent to

U+2010 HYPHEN
x (hyphen-minus - 002D)
x (soft hyphen - 00AD)

so that does no good. It's very annoying: What
good U+00AD over U+2010 if nothing treats them
differently?

Neither do any of these provide aid:

U+2011 NON-BREAKING HYPHEN
x (hyphen-minus - 002D)
x (soft hyphen - 00AD)
# <noBreak> 2010
U+2027 HYPHENATION POINT
U+2043 HYPHEN BULLET
U+FE63 SMALL HYPHEN-MINUS
# <small> 002D
U+FF0D FULLWIDTH HYPHEN-MINUS

I'd hoped ZWNJ vs ZWJ might be my cure, especially as
those are nicely accounted for in combining character
sequences--but only in the Standard, not in Perl!

Anyway, I don't think any of that works for what I want.

(The FULLWIDTH characters also alerted me to how
it's not just 0 or 1 positions one need worry about,
but even 2 positions. Yargh!)

I think I need to use these:

U+200B ZERO WIDTH SPACE
* commonly abbreviated ZWSP
* this character is intended for line break control;
it has no width, but its presence between two
characters does not prevent increased letter
spacing in justification

U+2060 WORD JOINER
* commonly abbreviated WJ
* a zero width non-breaking space (only)
* intended for disambiguation of functions for byte order mark
x (zero width no-break space - FEFF)

And if so, I suppose I could write:

co${WJ}elacanth co${ZWSP}efficient
co${WJ}inmaking co${ZWSP}inmate
co${WJ}opering co${ZWSP}operating
re${WJ}alize re${ZWSP}align
re${WJ}aper re${ZWSP}apply
re${WJ}eling re${ZWSP}elect
re${WJ}indeer re${ZWSP}indict

as${WJ}hen gas${ZWSP}house
cat${WJ}hedral cat${ZWSP}hole
no${ZWSP}w${WJ}here snow${ZWSP}house
roug${WJ}hen pig${ZWSP}headed
pseudo${ZWSP}r${WJ}heumatic super${ZWSP}hero
sym${ZWSP}p${WJ}hony tap${ZWSP}house

Although that too poses problems of its own. It becomes
more difficult to detect word boundaries, for example.
And as we've seen, nothing seems to pay due attention
to whether something takes up what width or not.

Plus I'm hardly certain that ZWSP would be duly replaced
by a hyphen at an end-of-line word-break point. Programs
might just part them there without that!

It's rather frustrating. I can't seem to find any
useful properties that help with almost any of this.

>> I want properties that tell me the width of a grapheme cluster.
>> I can't seem to find those. The InHalfwidthAndFullwidthForms
>> property is utterly inadequate. I want to know whether a
>> cluster takes up 0, 1, or 2 spots on the line. I want to
>> know something's "print length". So does printf and format.

> Yes, please. That would be *so* helpful for anything that cares
> about formatting.

Quite so!

--tom

--

#!/usr/bin/perl
use 5.10.0;
use strict;
use charnames qw[ :full ];

sub analyse($$);

our $ZWNJ = "\N{ZERO WIDTH NON-JOINER}";
our $ZWJ = "\N{ZERO WIDTH JOINER}";

our $C_DIAR = "\N{COMBINING DIAERESIS}";

our $C_DBLINE = "\N{COMBINING DOUBLE LOW LINE}";
our $C_SGLINE = "\N{COMBINING LOW LINE}";

our $WIDE_C = "\N{FULLWIDTH LATIN CAPITAL LETTER C}";

our %Quotes = (
"SINGLE QUOTE" => [
chr(0x2018), # LEFT SINGLE QUOTATION MARK
chr(0x2019), # RIGHT SINGLE QUOTATION MARK
],

"DOUBLE QUOTE" => [
chr(0x201C), # LEFT DOUBLE QUOTATION MARK
chr(0x201D), # RIGHT DOUBLE QUOTATION MARK
],

"DOUBLE-ANGLE QUOTE" => [
chr(0x00AB), # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
chr(0x00BB), # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
],

"CORNER" => [
chr(0x231E), # BOTTOM LEFT CORNER
chr(0x231D), # TOP RIGHT CORNER
],
);

our ($LQ, $RQ) = @{ $Quotes{"SINGLE QUOTE"} };
our ($LQQ, $RQQ) = @{ $Quotes{"DOUBLE QUOTE"} };
our ($LA, $RA) = @{ $Quotes{"DOUBLE-ANGLE QUOTE"} };
our ($LK, $RK) = @{ $Quotes{"CORNER"} };

my $word = "coefficient";
our $WLEN = length($word);

analyse("simple", $word);

$word =~ s/e\K/$C_DIAR/;
analyse("diaeresis", $word);

analyse("all caps", uc($word));

$word =~ s/f\K/$ZWJ/g;
analyse("zero width joiners", $word);

$word =~ s/\X\K/$C_SGLINE/g;
analyse("singly underlined", $word);

$word =~ s/$C_SGLINE/$C_DBLINE/g;
analyse("double underlined", $word);

$word =~ s/c/$WIDE_C/;
analyse("wide cap c", $word);

sub analyse($$) {
my($style, $text) = @_;
my @chunks = ($text =~ /\X/g);
print("Word in its ${LQQ}\U$style\E${RQQ} form is ${LQ}$text${RQ}\n");
printf(" and ${LA}%${WLEN}.${WLEN}s${RA} has %d chunks: ${LK}%s${RK}\n\n",
$text, scalar(@chunks),
join (" \N{HYPHENATION POINT} ", @chunks));
}

Karl Williamson

unread,

Sep 11, 2009, 5:27:24 PM9/11/09

to Tom Christiansen, Moritz Lenz, perl5-...@perl.org

> ZWNJ and ZWJ to bring it up spec.[
snip

I'm now looking into this problem from the above thread earlier in 2009.
And I don't fully understand this. UAX#29 more simply gives the re
for this as:
base? ( Mark | ZWJ | ZWNJ )+

A base character is NOT \PM precisely, because it excludes separators
except for spaces (Zs) and C (except for Private Use, at the discretion
of the implementation. So Perl is wrong here as well.

UAX29 explicitly notes that a single base char is not legal, and that
doesn't make sense to me. Perl is using \X to match a single entity,
and the Unicode definition doesn't apply unless there is something
following it. That doesn't seem useful to me. It means that \X
shouldn't match 'A', or any other ASCII character, unless I'm really
missing something.

It also explicitly states that it can match just a combining sequence
without a base character. But that is not illegal. They call it
defective, but still legal: "A combining mark in a defective combining
character sequence has no associated base character and thus cannot be
said to depend on any particular base character. This is one of the
reasons why fallback processing is required for defective combining
character sequences.

The bottom line is I don't know what \X really should be. I'm thinking
it should require a base character, because when someone uses \X, I
believe they think it's going to match something, like 'A'.

Karl Williamson

unread,

Sep 11, 2009, 5:45:31 PM9/11/09

to Tom Christiansen, Moritz Lenz, perl5-...@perl.org

I looked a little more, and discover that TR18 says that \X should
instead of being a combining sequence, be an extended grapheme cluster.
But if I look at the rules for that, I don't see how that gets the job
done, even though the examples claim it does.

Karl Williamson

unread,

Sep 12, 2009, 2:02:58 PM9/12/09

to Tom Christiansen, demerphq, Moritz Lenz, perl5-...@perl.org, Jarkko Hietaniemi

And here is what Unicode says \X should match (from UAX#29):

CR LF | Prpnd* ( Hngl-sylbl | !Cntrl ) ( Grphm_xtnd | Spc_Mark)* | .

These may not mean exactly what you think they do. The CR LF does, but
each of these is defined as part of the Grapheme Cluster Boundary
property, which Perl doesn't currently look at. Cntrl is not identical
to the standard cntrl, but is formulated specifically for this property.
Grphm_xtnd is many of the marks, and includes ZWNJ, ZWJ. So this re
works essentially like our \PM\pM*. Without delving into the old
versions of the standard, my guess is that they found the previous
method using combining sequences to be inadequate as more languages were
added to the standard. I can change mktables to generate tables for the
two unions above, and then regexec.c can be changed later to use them
instead of the current system.

\b in Perl is also wrong. It looks for alphanumerics, not words. I
looked quickly at regexec.c for this and didn't even find it checking
for underscore, though it must somehow, or someone would have filed a
bug report by now. Now Unicode has a much fancier algorithm to find
word boundaries, using the Word Boundary property, which we may or may
not want to use, but \b should at least be changed to look at words
which are a superset of alphanumerics.

Demerphq

unread,

Sep 12, 2009, 2:42:59 PM9/12/09

to karl williamson, Tom Christiansen, Moritz Lenz, perl5-...@perl.org, Jarkko Hietaniemi

2009/9/12 karl williamson <pub...@khwilliamson.com>:

The reference to isALNUM() is misleading as it is defined as:

handy.h:#define isALNUM(c) (isALPHA(c) || isDIGIT(c) || (c) == '_')

>Now Unicode has a much fancier algorithm to find word boundaries,
> using the Word Boundary property, which we may or may not want to use, but
> \b should at least be changed to look at words which are a superset of
> alphanumerics.

Well \b's definition, and er, eccentricities are well documented. Are
you sure this is a good idea?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Karl Williamson

unread,

Sep 12, 2009, 5:48:54 PM9/12/09

to demerphq, Tom Christiansen, Moritz Lenz, perl5-...@perl.org, Jarkko Hietaniemi

No, I'm not sure, but I was looking at regexec.c where it goes for
isALNUM_uni in the BOUND case (which \b translates into). Further
digging revealed the following in utf8.c
bool
Perl_is_utf8_alnum(pTHX_ const U8 *p)
{
[snip]
/* NOTE: "IsWord", not "IsAlnum", since Alnum is a true
* descendant of isalnum(3), in other words, it doesn't
* contain the '_'. --jhi */
return is_utf8_common(p, &PL_utf8_alnum, "IsWord");

That 'IsWord' in the last line says I was wrong, and the code is
actually looking at Word instead of alnum.

But now I have another question. Just below this is code for looking up
IsAlnumC. There is no table for this, so I would think it should fail.
I didn't find it documented, except mostly in this patch message from
2002:
This patch removes

is_uni_alnumc()
is_uni_alnumc_lc()
is_utf8_alnumc()

from utf8.c.

As best I can tell, they are not used in the core, and likely not used by
anything else, since using them results in an immediate fatal error (as it
has since at least 5.005). They date from a time when is_uni_alnum() was
alphanumerics only (lacking '_').

Jeffrey

------
It appears that this patch didn't get applied. I see a bunch of
references to alnumc scattered through the code.

Rafael Garcia-Suarez

unread,

Sep 13, 2009, 7:32:23 AM9/13/09

to karl williamson, demerphq, Tom Christiansen, Moritz Lenz, perl5-...@perl.org, Jarkko Hietaniemi

2009/9/12 karl williamson <pub...@khwilliamson.com>:

> But now I have another question. Just below this is code for looking up
> IsAlnumC. There is no table for this, so I would think it should fail. I
> didn't find it documented, except mostly in this patch message from 2002:
> This patch removes
>
> is_uni_alnumc()
> is_uni_alnumc_lc()
> is_utf8_alnumc()
>
> from utf8.c.
>
> As best I can tell, they are not used in the core, and likely not used by
> anything else, since using them results in an immediate fatal error (as it
> has since at least 5.005). They date from a time when is_uni_alnum() was
> alphanumerics only (lacking '_').

Right. I've now removed those functions and one global var from blead :

commit 5fba0dddeee4e48144ce1f17a6e372ca4c980087
Author: Rafael Garcia-Suarez <rgarci...@gmail.com>
Date: Sun Sep 13 12:45:47 2009 +0200

Remove obsolete interpreter variable PL_utf8_alnumc

commit 334b0924e9cb80a0a1a60e44bb69faef523ef01c
Author: Rafael Garcia-Suarez <rgarci...@gmail.com>
Date: Sun Sep 13 12:43:40 2009 +0200

Remove obsolete functions is_uni_alnumc, is_uni_alnumc_lc, is_utf8_alnumc